Social complexity, life-history and lineage influence the molecular basis of castes in vespid wasps

A key mechanistic hypothesis for the evolution of division of labour in social insects is that a shared set of genes co-opted from a common solitary ancestral ground plan (a genetic toolkit for sociality) regulates caste differentiation across levels of social complexity. Using brain transcriptome data from nine species of vespid wasps, we test for overlap in differentially expressed caste genes and use machine learning models to predict castes using different gene sets. We find evidence of a shared genetic toolkit across species representing different levels of social complexity. We also find evidence of additional fine-scale differences in predictive gene sets, functional enrichment and rates of gene evolution that are related to level of social complexity, lineage and of colony founding. These results suggest that the concept of a shared genetic toolkit for sociality may be too simplistic to fully describe the process of the major transition to sociality.

Lines 381: this almost certainly has to do with the use of brain transcriptomes. Its very possible that the patterns found in the brain would be flipped for other tissues. In fact, Jasper et al (2014) suggests this, as the brain is the least derived tissue in terms of novelty at the sequence level, at least in the honey bee. Thus, the argument that life history and lineage are important are probably true, but tissue specificity of the results cannot be neglected and should be discussed at some point in the paper.
Line 476: Kapheim et al 2015 should be cited and discussed as it also shows difference in genomic evolution with grade of eusoical complexity.
Lines 511-526: another interpretation is that this result is dependent on the tissue, brain, and might change when different tissues are analyzed Reviewer #3: Remarks to the Author: This study uses comparative transcriptomics to test the hypothesis that the genes responsible for regulating caste behavior are similar across species that vary in social complexity. It is clear that the authors put a great deal of care into species selection to ensure they had a wide representation of social organization within one family of wasps. This is the most expansive comparison of social evolution in wasps from a transcriptomics perspective, which is no small feat given that there are so few publicly available wasp genome assemblies. This study thus provides an important point of comparisons for other comparative studies of social evolution that have focused on bees or ants. While this project thus holds an important place in the field, I have two major critiques of the paper that would need to be addressed prior to drawing conclusions based on the results presented here.
First, there are some methodological limitations that make it difficult to place too much confidence in the results. Most concerning is the details of the sample collection. It would be nice to have a sense of how the samples were collected and their sample sizes to get a sense of whether queen-worker behaviors/states/dynamics were likely to be captured in a similar way between species. For example, if workers in some species happened to be foraging and others happened to be sleeping or engaged in aggression at the time of collection, this could impact the degree to which gene expression patterns would represent queen/worker differences in a general sense. Similarly, if the colonies were at different points in the colony cycle, this could have a major effect on brain gene expression differences between castes. Moreover, it sounds from the Methods that queens and workers were identified based solely on ovary activation and sperm presence. This seems like an imprecise, if not error-prone, method given that some workers in some eusocial species can mate and lay eggs or there could be gynes with unactivated ovaries in the nest. Ultimately, whole brain gene expression profiles are a snapshot in time, and it is difficult to know how representative these are of the processes regulating caste differences more generally.
Providing sample sizes would also give the reader an idea of how much variance was likely to play a factor in the differential gene expression differences between species. Table S1 seems to indicate that several species had just one biological replicate of pooled samples for queens and workers. This seems counter to what is described in the Methods, which seems to indicate there are replicates of 2-4 pooled samples per caste/species, so some additional clarity could be helpful. This n= 1 seems to be confirmed later in the Methods (L763). If this is truly the case that each species was represented by a single pooled sample of Queens and a single pooled samples of Workers, it seems difficult to draw too many comparative conclusions from the analyses. Although I am not an expert on SVM methods, 1 replicate of 8 species (or fewer in the case of the follow-up analyses) seems like an extremely small training set for machine learning. A quick google search suggests that a 100 or fewer datapoints is considered a small training set. This dataset does not seem to come close to this. Nonetheless, it seems there are some basic steps that can be taken to determine if the training set is large enough, such as plotting learning curves of training set size vs classification errors. Was this done?
One additional point on methods, or the approach more generally, is that so few aspects of each species' life history were taken into consideration. There are many ecological variables that could influence gene expression or rates of molecular evolution. A comprehensive way to address this would be to include several axes of ecological and social variation in the models, but at a minimum they should be tested or discusses as alternative explanations for the observed data. For example, additional aspects of biological variation are considered toward the end of the Results section when the same set of analyses are applied to colony founding method (swarm vs independent). However, this seems like an afterthought and still leaves plenty of other ecological differences (e.g., tropical vs temperate, type of nest, diet, etc) ignored. Fig. 1 does not provide any information about the rest of ecology of the species.
A second major limitation of the paper is the presentation of the results. In several places, it was difficult to determine what exactly had been done and what significance should be place on the results being described. In many places, this could be easily fixed by (a) presenting a clear hypothesis and prediction prior to describing the result and (b) providing a few additional details about the methods that were used to generate the result. Some examples are below.
I had a hard time following the results of the SVM analysis, because some of the statistical terms used were unfamiliar to me. What does it mean to select features based on a linear regression from left to right? What does 0.6 likelihood mean? Is this like probability? So 60% of samples would likely be classified correctly? (This seems to be confirmed later in the Methods, but it would have been useful while reading the results.) Also, it seems like the method was used to see how well Queens could be classified. Was this also performed for Workers? I wonder if Workers would be more or less easy to classify.
I was more excited about the use of SVM to train a classifier on species with simple societies and see how well it predicts caste in more complex societies. Not withstanding my questions about classification confidence above, it seems this test performed fairly well. It is intriguing that the reverse did not perform well, which is what you might expect given hypotheses presented in previous literature (e.g., Johnson Linksvayer 2010 Q Rev Biol). In the Discussion this result was described as "unexpected", but it seems to me the Johnson & Linksvayer paper suggests this is exactly what one would expect, given the degree of specialization and emergent properties that emerge in the "superorganismal" species. However, I noted that few statistical results were provided in the text when describing this result and looking at Fig. 5b the scores do not look that bad! It might help bolster the conclusion if additional details were provided in the text.
I also had some difficulty connecting the results of the dN/dS analysis to the results based on gene expression. I think this section could benefit from some clear hypotheses and predictions. This would help the reader understand how the dN/dS results are meant to be interpreted. In some places dN/dS rates >= 1 are taken to indicate positive selection and in other places they are more accurately described as indicators of more rapid evolution, which could of course include neutral evolution. Some additional description of the methods in the main text might help to justify these inferences.
The sentence "Some orthogroups had experienced significant positive selection (i.e., on the given foreground branch, null model rejected, chi-square test p < 0.05, dNdS => 1; a total of eleven orthogroups)" was especially confusing to this reviewer. Does this mean there were a total of 11 orthogroups with dN/dS > 1 for all 9 iterations of an individual species on the foreground branch? Or this is the number that was common to all 9? Some additional detail could help to clarify, particularly if placed within a hypothesis-prediction framework. I found the sentences following that describing the individual orthogroups to be difficult to interpret, partially because it was not clear what these orthogroups really signify and so it was difficult to become too invested in learning their (putative and inferred) functions.
The conclusion of this whole section is that "the molecular processes regulating caste differentiation vary depending on the level of social complexity". However, I am unclear as to how the dN/dS rates of single copy orthologs pertain to the regulation of caste differentiation. How do dN/dS rates map onto other features of each species ecology? Is there justification for interpreting rates of evolution solely through the lens of caste differentiation? Generally, I think this section could have much higher impact and clarity if some clear hypotheses and predictions were presented.
Overall, this study represents a lot of important work and the genomic resources alone will be highly valuable for the field. A comprehensive comparative study of wasp transcriptomics is highly valuable in a field that has been dominated by ant and bee research. That said, there are some major limitations that will need to be addressed before this reviewer feels confident in drawing too many conclusions from the results.

Response to Reviewers 1 2
Thank you for your time revising our manuscript. This has been extremely helpful 3 and we hope we have adequately answered your concerns. 4 5 Specifically, we have rewritten the introduction and results sections to improve the 6 flow of the paper (including the addition of hypotheses, as suggested by reviewer 3). 7 Also, we have taken care to improve the methods specifically and conducted new 8 analyses to show that the data are reliable, yet acknowledge the limitations of our 9 experimental design and the conclusions we can take form them. 10 11 Finally, we highlight the addition of a Nextflow workflow (not in the original 12 manuscript) to run the SVM analysis from this paper (Github: This manuscript is the product of considerable and sophisticated work. The authors 33 state to have found evidence of a shared genetic toolkit across nine wasp species 34 representing different levels of social complexity, but also evidence for shifts in the 35 gene networks regulating social behaviour and rates of gene evolution that are 36 influenced by innovations in both social complexity and life-history. I agree, and this 37 manuscript will be of interest to a general reader. However, I am troubled by their 38 outliers, Brachygastra and Agelaia, which clustered in opposite directions in the first 39 analysis. According to the authors, these two taxa exhibit the most social complexity 40 among their Polistinae. Why should this be so? This leads me to infer that something 41 is being missed by the authors. I should like to see the authors offer more than 42 "These two outlier species may not share the same caste-specific patterns as the 43 other species, but we cannot rule out data and/or sampling 44 anomalies." 45 46 Thank you for your review, we are pleased you highlighted the relevance and effort 47 gone into our publication. We really appreciate the time you have spent reviewing 48 our paper. 49 50 Rev 1 Comment 1: With regards to the two outlier species (Brachygastra and 51 Agelaia), we agree that these deserve more discussion of possible biological and 52 technical explanations. We now provide information on how we ruled out possible 53 technical explanations for these outliers. Below is a brief summary of what we've 54 added.

56
First, we determined whether there could be any concerns with respect to the quality 57 of the sequence data. We conducted a FASTQC analysis, measuring quality 58 statistics for each of our input reads (this has been added to Supp "Top_phylum" and "Top_genus"). This analysis shows that the majority of assembled 92 translated genes appear to be from arthropods; moreover, by genus, we mainly find 93 wasps with the correct genus level mappings expected, with little contamination from 94 other species. Where sequences matched closer to non-arthropod sequences 95 (Supp . Table 1; tab: "Top_phylum"), the majority come from proteobacterial genes 96 (the most common brain bacteria), fungi (inc. Wolbachia) and occasional 97 mammalian (inc. Human) sequences. However, these small levels of contamination 98 do not affect our analysis because we focus on orthologous genes across all nine 99 species -any contaminants in one or a few species would by definition not be 100 included. This is now referred to in the main text (Rev unlikely as all species were collected in a similar way: Agelaia was even collected in 105 the same place and time as other species (e.g. Metapolybia). However, as we note 106 in the methods, we were only able to sample one colony of Agelaia, and so we 107 cannot rule out the possibility that something was unusual about this colony (e.g. lost 108 its queens; recently predated; stage in colony cycle). This is included in the collection 109 methods (comment: Rev 1 Comment 1.2, line 706), and is now mentioned as a 110 possible explanation in the discussion (Rev 1 Comment 1.3, line 593).

112
In summary, we can see no technical reasons to exclude any samples due to any 113 sampling or processing anomalies, and no technical explanations for why Agelaia 114 and Brachygastra appear to be outliers in our analysis. These are explained in the 115 results section (Rev 1 Comment 1.4, line 510-517), and we now have included a full 116 table outlining all the quality scores (within Supplementary Table 1).

118
In the absence of a technical explanation, let's now consider possible biological 119 explanations for why these two species are outliers. The reviewer rightly notes that 120 they cluster differently from the other species; this suggests that these two species 121 do not use the "toolkit" genes for caste differentiation in the same way as the other 122 wasp species. What is biologically different about these two species that could 123 explain this? There is no clear phylogenetic explanation as they are epiponines, 124 along with Angiopolybia and Metapolybia; it is unlikely that any environmental 125 differences can be an explanation as these two species occur in similar 126 tropical/subtropical conditions to all the other Polistines used in our study and in the 127 case of Agelaia they were sampled concurrently (in the same general location and 128 time period) as several of the other species. The main difference is that Agelaia and 129 Brachygastra represent intermediate levels of social complexity, as explained in 130 Figure 1: their societies are not as simple as Polistes, Mischocyttarus, Angiopolybia 131 or Metapolybia, but they are less socially complex than the superorganismal 132 vespines. Our analysis suggests that the same conserved social toolkit was likely 133 present in the common ancestor, and is also present in the highly complex societies. 134 but may be less important in species that are intermediate in complexity. One 135 interpretation of this is that a different set of genes regulates caste differentiation in 136 these intermediate forms of social complexity. If these species are indeed proxies for 137 the different stages of the major transition to superorganismality (as is generally 138 thought), then it is possible that Agelaia and Brachygastra represent a stage in the 139 major evolution transition that involves a possible upheaval of molecular processes 140 regulating castes, where pre-imaginal caste commitment is partially apparent. 141 Alternatively, these forms of social complexity may be unconnected with any 142 evolutionary patterns, and are an example of where a conserved molecular toolkit for 143 castes is not upheld. In this case, it is very interesting (albeit unexplained) that these 144 societies with intermediate levels of social complexity are regulated using different 145 processes to other types of societies. We are unable to address these ideas further 146 without wide taxonomic sampling. This is outside the scope of this study, but we 147 have included a new section in the discussion to address this.

149
Currently, the most parsimonious explanation is that there is a biological explanation 150 for these samples being a little different to the other 7 species but that further 151 sampling would be required before we can be sure how to interpret these outliers. 152 Overall, though, our main conclusion stands: we have found a robust set of genes 153 that are caste-biased across a wide diversity of social wasps. This remains the first 154 multi-species transcriptomic analysis of its kind in wasps. We hope it will stimulate 155 further genomic analyses of this interesting group 156 157 Rev 1 Comment 2: There are a number of minor errors that are probably inevitable 158 with a mansucript consisting of so many elements: Sumner (2014) appears twice in 159 the literature citations; "were did" on line 492; 160 Thanks, these issues have been fixed. General Points 174 175 The submitted study explores eusocial evolution using transcriptomics and machine 176 learning. An understudied group (wasps other than polistes) are studied and 177 relatively novel methods based on machine learning are used to good effect to show 178 that a conserved toolkit of genes underlies caste differences, but is not the only 179 explanatory model necessary. The authors highlight the role of lineage and life 180 history as well as social complexity in defining the nature of genomic evolution. 181 Overall, I think this is a great study and worthy of publication in this journal. My 182 comments relate to only two points. First, the citation record is weak in some places, 183 and second, some of the conclusions drawn may depend on the choice of tissue 184 (brain) used in the study. Some discussion of the possibility that the results may be 185 tissue specific is necessary. 186 187 Thank you for your kind words about our publication. We really appreciate the time 188 you have spent reviewing our paper. We now provide all the cutoffs to make this part clearer -we have added a histogram 259 to off. This highlights the fact that we do not recover many differentially expressed 261 genes across all nine species. Our use of 2+ and 4+ cut-offs were arbitrary. We 262 chose 2+ (original plot 3B) for the Gene Ontology (GO), so that we included sufficient 263 genes to run the analysis, and we showed the names and expression patterns of the 264 4+ genes (original plot 3C) due to space available in the plot (3+, would be 166 lines, 265 too many to show clearly on a page, with minimal font size). We provide the full list 266 in Supplementary Table S3. To be consistent, we added a new GO plot (part B-267 using DEGs in any direction; n=552; before it was in the same direction) and now 268 highlight which plots are coming from the cumulative values in the new plot 3A. The 269 terms are similar to the previous result, but now we can clearly show in the figure  270 where these gene sets come from 271 272 273 Rev 2 Comment 8: Lines 381: this almost certainly has to do with the use of brain 274 transcriptomes. Its very possible that the patterns found in the brain would be flipped 275 for other tissues. In fact, Jasper et al (2014) suggests this, as the brain is the least 276 derived tissue in terms of novelty at the sequence level, at least in the honey bee. 277 Thus, the argument that life history and lineage are important are probably true, but 278 tissue specificity of the results cannot be neglected and should be discussed at 279 some point in the paper.

281
The reviewer is referring to our finding that rates of evolution were higher in the 282 simpler societies than the more complex ones, and that this is counter to previous 283 studies on bees which found higher rates in the more complex social species. This 284 cannot be explained by the tissue choice, as the study we are comparing our data 285 with (Shell et al 2021) was also on brains (whole heads), as indeed are many other 286 studies on DEG in social insect castes. However, the choice of tissue is an important 287 one (raised earlier by this reviewer), and certainly brain tissue appears to have fewer 288 highly expressed differentially expressed genes compared to other tissues. As 289 explained above, we have now added some text to justify our choice and 290 acknowledging the possible limitations (see "Rev 2 Comment 2" changes in MS). This has now been cited (Rev 2 Comment 9, line 519) 295 296 Rev 2 Comment 10: Lines 511-526: another interpretation is that this result is 297 dependent on the tissue, brain, and might change when different tissues are 298 analyzed.

299
We Similarly, if the colonies were at different points in the colony cycle, this could have a 334 major effect on brain gene expression differences between castes. Moreover, it 335 sounds from the Methods that queens and workers were identified based solely on 336 ovary activation and sperm presence. This seems like an imprecise, if not error-337 prone, method given that some workers in some eusocial species can mate and lay 338 eggs or there could be gynes with unactivated ovaries in the nest. Ultimately, whole 339 brain gene expression profiles are a snapshot in time, and it is difficult to know how 340 representative these are of the processes regulating caste differences more 341 generally.

343
We had already provided quite some detail about the collection methods, sample 344 sizes (both number of colonies and number of individuals) and how we characterised 345 the castes for each species. Specifically: 346 -the first paragraph of the Methods section entitled 'Sample collection" (Rev 3 347 Comment 1.1, line 701) gives the number of colonies sampled for each 348 species; these are also listed in Suppl. table S1. In the same section we 349 explain how samples were pooled to generate caste-specific transcriptomes; 350 again, this includes sample sizes but also this is given in complete detail in 351 Suppl. Table S1.

352
-We also provided some extensive morphometrics datasets and analyses of 353 three species (Brachygastra, Polybia and Metapolybia) as explicit data on the 354 level of caste allometry were lacking in the literature -these are found in 355 Supplementary  1.7, lines 727-738).

376
In this way we can be certain that we didn't select reproductive workers, 377 gynes or callows. It is worth noting here that it is impossible to collect queens 378 from any of the epiponines or vespines without opening the nest because of 379 the envelop that surrounds them. There is therefore no other way to identify 380 these castes, other than through ovarian dissections. Indeed, the entire 381 literature on these insects is based on assigning caste from dissections. It is 382 hard to see how else this could have been done. With over 20 years of 383 experience working with social wasps (including dissecting them and studying 384 caste behaviours), we are very confident in our assignment of caste. The only 385 possible anomaly is Agelaia, where (as the reviewer points out) we only 386 sampled from one colony as they are hard to find. Indeed, this could explain 387 why they were somewhat of an outlier (see response to Reviewer 1); we 388 discuss this in the Discussion now (see text highlighted Rev 1 Comment 1). 389 390 391 Rev 3 Comment 2: Providing sample sizes would also give the reader an idea of 392 how much variance was likely to play a factor in the differential gene expression 393 differences between species. Table S1 seems to indicate that several species had 394 just one biological replicate of pooled samples for queens and workers. This seems 395 counter to what is described in the Methods, which seems to indicate there are 396 replicates of 2-4 pooled samples per caste/species, so some additional clarity could 397 be helpful. This n= 1 seems to be confirmed later in the Methods (L763). If this is 398 truly the case that each species was represented by a single pooled sample of 399 Queens and a single pooled samples of Workers, it seems difficult to draw too many 400 comparative conclusions from the analyses.

402
It is a difficult trade-off in these types of analyses when resources are limited: 403 replication within species versus between. For our question, it was important to 404 maximise power in making comparisons across species, rather than within species. 405 We have now included in the results (mentioning "individuals and pools" were used 406 [Rev 3 Comment 2.1, line 158] and that "Using these data we could construct de 407 novo brain transcriptomes and estimate gene expression for a pool of queens and a 408 pool of workers for each species." [ 3.2, lines 926-929)). Using 6 448 fold cross validation of eight species (repeated; with one species left out), that the 449 cross validation error using all the data was around 0.9 (showing that the predictions 450 were not correct, using all the data). Yet, when you use feature selection the error 451 reduced to ~<0.05 when using ~<20% of the most informative genes (after feature 452 selection), in the nine replicates with a species left out. Having error scores lower 453 that 0.05 show that the svm model is accurate and consistent across different input 454 datasets. We hope this helps show that the data are consistent in their predictive 455 power over the different feature selection filters.

457
We propose that it is sufficient to make our conclusions by training on 8 species and 458 testing on the ninth. We argue that even with our low number of samples, given we 459 are using gene expression data with thousands of features, that we can make 460 accurate predictions. For example, many biomedical papers (e.g. 461 https://www.sciencedirect.com/science/article/abs/pii/S0169260714003228?via%3Di 462 hub) have less than 40 features per sample (instead of thousands in our dataset).

463
This is highlighted in the result shown in Figure 4a, where after feature selection, we 464 get the expected predictions of caste in the majority of species, suggesting that there 465 is sufficient signal in the training sets to make a prediction. However, these 466 classification estimates rarely get above 0.6, suggesting that these predictions are 467 not clear cut, almost certainly due to the low sample number in the analysis.

469
Ultimately, SVMs are powerful at detecting small changes across large number of 470 samples, yet they can still perform well enough even with small datasets. We add a 471 comment in the methods and discussion acknowledging the concerns about using this 472 approach on smaller datasets (Rev 3 Comment 3.3, lines 499-501).

474 475
Rev 3 Comment 4: additional point on methods, or the approach more generally, is 476 that so few aspects of each species' life history were taken into consideration. There 477 are many ecological variables that could influence gene expression or rates of 478 molecular evolution. A comprehensive way to address this would be to include 479 several axes of ecological and social variation in the models, but at a minimum they 480 should be tested or discusses as alternative explanations for the observed data. For 481 example, additional aspects of biological variation are considered toward the end of 482 the Results section when the same set of analyses are applied to colony founding 483 method (swarm vs independent). However, this seems like an afterthought and still 484 leaves plenty of other ecological differences (e.g., tropical vs temperate, type of nest, 485 diet, etc) ignored. Fig. 1 does not provide any information about the rest of ecology 486 of the species.

488
This comment surprised us, as we did consider some of the effects of ecological and 489 demographic variables, whereas few studies on caste genomics do. We devote an 490 entire analysis to this very question. Fig. 1 includes the key ecological variables that 491 are thought to be important in wasp social behaviour and ecology. All social wasps 492 are generalist predators, with a preference for lepidopterans and dipterans. 493 Regarding the type of nest, the main difference is that the independent founding 494 Polistines build nests without an envelope, whilst all the others have envelopes. It is 495 not possible to examine the contribution of this nest type variable to caste-biased 496 gene expression because it confounded by both phylogeny and level of social 497 complexity. Regarding tropical/temperate: all the Polistines (which happen to also 498 be those with simpler societies) sampled are tropical and all the Vespines (which 499 happen to also be the only truly superorganismsal species) sampled are temperate; 500 so, again, these variables are confounded. However, the reviewer raises some valid 501 points and so we have made the following changes.

502
-We have revised Figure 1 to reflect these other life history traits; specifically, 503 we now add more of the life history traits to Figure 1 and expand the legend.

504
-To try to explore the possible effects of life-history a little further, we 505 conducted a FAMD plot (Factor analysis of mixed data; see below), showing 506 that queen and worker differences were the greatest explaining variable in the 507 dataset, with the other life-history variables not contributing as much to these 508 two top dimensions. We do not currently choose to include this in our revision, 509 as we feel that the text deals with this. However, if the reviewer or editor 510 would like us to, this can be added to the methods or supplementary 511 document. 512 513 Review Figure 1. Dimension reduction plot using FAMD (Factor analysis of mixed data).

515
Plot the dimension reduction to two components using all the orthologous genes in the nine 516 species (with species normalization step), then applying categorical data to the samples, 517 including reproductive (Queen) or not (Worker), plus nest-type, envelope type, nest-   This meant that from left (or 99% in x axis) to right (1%), the percentage of the 547 remaining features chosen using linear regression, for which we plotted the predicted 548 classification scores. Linear regression is picking out genes that are informative with 549 regard to caste, and the predictions therefore improve if we focus on the subset of 550 genes different between the castes in the training dataset. This of course will have 551 no impact on the test dataset, which is not used in the linear regression. We have 552 rewritten this methods and results section to better explain this, and in figure  553 legends. Rev 3 Comment 5.3 (lines 1344-1346) 554 555 What does 0.6 likelihood mean? Is this like probability? So 60% of samples would 556 likely be classified correctly? (This seems to be confirmed later in the Methods, but it 557 would have been useful while reading the results.) 558 559 We have changed this term in the manuscript now, only using the term "classification 560 estimate" (e.g. Rev 3 Comment 5.4, line 259), instead of likelihood, which is stated, 561 as we cannot take statistical likelihood directly from SVM models (although this can 562 be accomplished using Platt scaling; 563 https://home.cs.colorado.edu/~mozer/Teaching/syllabi/6622/papers/Platt1999.pdf), 564 so "classification estimate" is more appropriate and hopefully not misleading as 565 'likelihood' is in our case. We have changed this term throughout the paper and 566 expanded both the results and the methods section to fully explain this classification 567 score. We have also added a section in the SVM results, to clarify what the 0 and 1 568 scores mean in terms of worker and queen, which was lacking (Rev 3 Comment 569 5.5, lines 248-252). 570 571 572 Also, it seems like the method was used to see how well Queens could be classified. 573 Was this also performed for Workers? I wonder if Workers would be more or less 574 easy to classify. 575 576 Workers have exactly the opposite score of the queen for each prediction (e.g 0.9 in 577 queen will be -0,9 in worker, for the same species). This is due to our pooling of 578 replicates to one sample for each caste, so when we scale from -1 to 1, queens and 579 workers fit on the opposite scale. 580 581 Rev 3 Comment 6: I was more excited about the use of SVM to train a classifier on 582 species with simple societies and see how well it predicts caste in more complex 583 societies. Not withstanding my questions about classification confidence above, it 584 seems this test performed fairly well. It is intriguing that the reverse did not perform 585 well, which is what you might expect given hypotheses presented in previous 586 literature (e.g., Johnson Linksvayer 2010 Q Rev Biol). In the Discussion this result 587 was described as "unexpected", but it seems to me the Johnson & Linksvayer paper 588 suggests this is exactly what one would expect, given the degree of specialization 589 and emergent properties that emerge in the "superorganismal" species. 590 591 We However, I noted that few statistical results were provided in the text when 597 describing this result and looking at Fig. 5b the scores do not look that bad! It might 598 help bolster the conclusion if additional details were provided in the text. 599 600 Yes, we agree that we should have included some of the main numbers in the text 601 (now included; Rev 3 Comment 6.2, lines 350-351). Some of the predictions were 602 indeed not so bad in Figure 5b (inc. Mischocyttarus), but used a much smaller list of 603 the genes after filter selection to obtain this. This suggests that with much greater 604 levels of filtering you can still obtain semi-decent classifications. We have now 605 rewritten this section to highlight the fact that the simple society trained SVM used a 606 much larger gene set (~800), with high levels of correct classification compared to 607 the complex society trained SVM (~350), this is perhaps more informative about the 608 predictive ability of the two sets. (Rev 3 Comment 6.2, lines 350-351).

609
We have also improved the legend in Figure 5, to help with interpreting the plots.

REVIEWER COMMENTS
Reviewer #2 (Remarks to the Author): I am satisfied with the revisions. We are delighted that this Reviewer is happy with our revision. We thank them for their time, once again.
Reviewer #3 (Remarks to the Author): Thank you for your time helping us improve this manuscript.
The primary aim of the study is to determine whether there is a conserved genetic toolkit for sociality. The authors report that this toolkit does exist, but that other lineage-specific and life-history-specific factors matter too. This is not a terribly surprising conclusion, but it is important to describe it in this understudied group of social insects.
Comment 3.0 We should have stressed greater at the beginning that in general, wasps with simple societies are known to have very small numbers of differentially expressed genes between castes (a result also supported by our results). This meant that we were not certain that across so many species of wasps, that we would find a social toolkit. I have added this information in the introduction, as it was lacking, to highlight the fact that finding a social toolkit was not inevitable. We have added at line 123: ", known to have few caste-biased differential genes, potentially suggesting that a toolkit may not be present across wasps 5 ." One major question I have is about what it takes to determine if there is a conserved toolkit of not. Are the number of shared caste-biased genes across species greater than would be expected by chance? For example, if you randomly selected the same number of genes as were considered differentially expressed across species, how many times would there be 57 common to 4 or more species (Fig 3b)? It seems this kind of permutation test (or something similar) would have to be done before labeling the results as evidence for a conserved genetic toolkit. What if there were only 20, 10, or 5 genes differentially expressed across 4 species? Would this still be evidence for a shared toolkit? It seems some minimum number of genes would show up on multiple gene lists by chance, right? What is the null hypothesis? Alternatively (or perhaps concomitantly), you could look at the proportion of differentially expressed genes within a species and ask if more than expected by chance are shared with one or more other species. Thanks for raising this; it's important. We have now added the following analyses. First, we now show hypergeometric tests for significant pair-wise overlap of these differentially expressed genes across the nine species (see Fig 1 below; this could be added as a Supp. figure if the editor/reviewer prefers). Second, we performed a permutation test (as suggested, with 1000 permutations), taking all orthologous genes from Supplementary Table S3, and randomly selecting genes of the same number as were differentially expressed in each individual species, then checking how many genes overlapped between the 9 total species. With this, we could calculate a Fisher's exact test (two-sided) for our observed against expected. For example, in our study we observe 57 orthologs differentially expressed in 4 or more species, and calculated an average of 0.055 genes should be expected in 4 species (after 1000 permutations) by chance. We now show the expected, observed and Fisher's exact test in Figure 3, with a summary table below (Table 1); we have also added the method to the Methods section (L873, with the code on GitHub: https://github.com/Sumner-lab/Multispecies_paper_ML) and changed the figure legend on L1345. I also wondered about alternative explanations. In testing Hypothesis 2, could the reason for lower classification accuracy when going from complex to simple societies be related to phylogenetic history? The four species included in the complex society group are more distantly related to one another (and may also have more diverse ecological niche space (e.g. both temperate and tropical vs all tropical)?) than those included in the simple society groups. Could this also potentially explain the results of testing Hypothesis 5? (The swarm founders are more closely related than the independent founders.) I guess this is tested in Hypothesis 4, but perhaps the two sets of results could be more integrated to strengthen the story?
Comment 3.2 It is true that the four complex species represent a more diverse group, so this factor should be fully explored. Given there are only two vespids, we couldn't run SVM predictions based purely on these two vespids vs the other polistines as a training with 2 samples is not possible (mentioned in L380). However, we have tried repeating some of the SVMs using slightly different training sets to assess the effect of phylogeny of the predictions we report in the paper. First, we trained our model on four Polistines and testing on the fifth (Metapolybia, Mischocyttarus, Polistes, Polybia and Angiopolybia; removing the weakly supporting species Agelaia), to see if training within its own clade would improve the classifications of each test. This does not improve the classification in four out of the five Polistine species; Metapolybia was the only species with improved classifications. This suggests that our findings do not seem to be affected greatly by distance within the wasp family tree. However, we agree that it is important to acknowledge (and we do stress this at line 389 and 607 onward), that these analyses would benefit from additional replicate species of both simple and complex species in future studies to ensure that the findings are consistent. But, as a first attempt (and with the caveats made explicitly, as we already do in the discussion) we feel this is an important result and worth reporting here. The reviewer's point that Hyp 4 also provides some support for the fact that effects of phylogeny are less than caste. But integrating the two analyses together, we feel, would make the hypotheses less clearcut to understand. However, in the light of this we have switched round the hypotheses 3 (DnDs) with those of 4 (subfamily) and 5 (life history), so that the subfamily section follows on directly from Hyp 2.
Finally, this new data is updated in Supp. Table 6 (SVM of subsections, and its README), a new Supp. Figure 3 (Sub-family tests only), separate from Supp. Figure  4 (Life history SVMs), with Supp. Figure 5 being that of the DnDs analysis. Legends have been updated and references to these figures in the main text.
I appreciate the authors' justification and rationale for the lack of biological replicates within species and the use of SVM with small sample sizes. While I can appreciate the limitations in place, this does not change the fact that conclusions drawn from such small sample sizes are potentially unreliable.

Comment 3.3
The level of technical replication within species is small, we agree. However, the biological replication contributing to these samples is typical (or exceeds) that of most studies of this kind (mostly between 3 to 11 individuals in a pool). Prior to our study, studies on caste differentiation transcriptomics in wasps was limited to a few species from a single genus -Polistes, representing a single level of simple social complexity. Our study thus takes a big step forward in analysing data across nine species from two different subfamilies (there are only three subfamilies of social wasps), and nine different genera.
Nonetheless, the reviewer's concerns are valid -greater replication is always desirable; we had acknowledged and discussed this in the existing manuscript. E.g. line 347 ("future work should confirm this by expanding the repertoire of species")/ and in the discussion, line 485: ("Wider taxonomic sampling is required to determine the extent to which the role of these additional processes is driven by the evolution of superorganismality") and line 491: ("highlighting the importance of wider taxonomic consideration and replication in the quest to understand the molecular basis of sociality and the nature of the major transition"). See also response to Comment 3.2.
Comment 3.4 L165 -I would make it abundantly clear that you used a *single* biological replicate for queen and worker for each species.
L165, now makes this clear: It is not correct that we used a single biological replicate: we pooled many biological replicates for each caste for each species into a single technical replicate.
"These were sequenced as individual pools to generate a worker sample and a queen sample (both made up of between 3 and 11 biological replicates per caste pool). "